Preliminary Step: load example data

Read in (again) the gambling dataset:

gambling.data <- read.csv(file = "http://data.justice.qld.gov.au/JSD/OLGR/20170817_OLGR_LGA-EGM-data.csv",
                 header = TRUE,
                 sep = ",",
                 stringsAsFactors = FALSE)

# rename columns
names(gambling.data)[2] <- "Local.Govt.Area"
names(gambling.data)[7] <- "Player.Money.Lost"
names(gambling.data)[1] <- "Month.Year"

gambling.data <- na.omit(gambling.data)

#Add a day of month (1st) to each date string
date.string <- paste0( "1 " , gambling.data$Month.Year )

 #Convert to POSIXlt, a date-time format
strptime( date.string , format = "%d %B %Y" ) -> gambling.data$Date

 # subset to Brisbane only
brisbane.only <- gambling.data[gambling.data$Local.Govt.Area=="BRISBANE",]

# Choose only the brisbane data from 2010
#row.indicies <- (brisbane.only$Date>="2010-01-01 AEST" &
#                 brisbane.only$Date<="2010-12-31 AEST")
#brisbane.2010.data <- brisbane.only[row.indicies,]

Introduction to statistics with R

One of the really cool things about R, as opposed to other languages like python, matlab etc. is that it has such a wide user base in the statistics community and many packages that have been developed to analyze data with old and new statistical techniques and modelling. Most of these (but not all) are available on CRAN. Although we won’t cover much beyond the basics today (which could just as easily be done in most other languages), more advanced things like bayesian techniques

Descriptive Statistics

Measures of Central Tendency

Measures of Central Tendency

All the standard summary statistics are available in R including:

  • mean
  • sd
  • var
  • min
  • max
  • median
  • range
  • quantile
x <- c(1,3,5,2,9,NA,7,10,15,4,7,7)
x
##  [1]  1  3  5  2  9 NA  7 10 15  4  7  7
mean(x)
## [1] NA

If there are missing values, we can ignore them by setting na.rm=TRUE:

mean(x, na.rm=TRUE)
## [1] 6.363636
median(x,na.rm = T)
## [1] 7
var(x, na.rm = T)     # variance
## [1] 16.25455
sd(x, na.rm = T)      # standard deviation
## [1] 4.031693

In R, you can use the function quantile() to get median and quartiles, or you can also use the function summary(), to get the quantiles as well as the mean, and the number of NA values:

quantile(x, na.rm = T)
##   0%  25%  50%  75% 100% 
##  1.0  3.5  7.0  8.0 15.0
# mean,median,25th and 75th quartiles,min,max
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.500   7.000   6.364   8.000  15.000       1

Visual Summaries: Box plots and Histograms

You get most of the same summary statistics visually from a boxplot: (in base R graphics)

boxplot(x, horizontal = TRUE)

Or alternatively in ggplot:

library(ggplot2)

some.LGAs<-sample(gambling.data$Local.Govt.Area,5)

subset.data<-gambling.data[gambling.data$Local.Govt.Area %in% some.LGAs,]

ggplot(data=subset.data,
       aes(x=Local.Govt.Area,y=Approved.EGMs))+
  geom_boxplot()+
  scale_y_log10()+
  coord_flip()+
  theme_minimal()

We can also look at the histogram of the counts of the data. To obtain a histogram, the scale of the variable is divided into consecutive intervals and the number of observations in each interval is counted.

hist(x)

Or plot a histogram in ggplot:

ggplot(data=subset.data,
       aes(x=Approved.EGMs,fill=Local.Govt.Area))+
  geom_histogram()+
  scale_x_log10()+
  theme_minimal()

Correlation

x1=sin(1:100/3)
y1=sin(1:100/3)
plot(x1,y1)

cor(x1,y1)
## [1] 1
x2=rnorm(100)
y2=rnorm(100)
plot(x2,y2)

cor(x2,y2)
## [1] -0.1854688

Calculating correlations for many variables at once

#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
MoneyDateLGA<-gambling.data[,c("Player.Money.Lost","Date","Local.Govt.Area")]

spreaddata<-tidyr::spread(MoneyDateLGA,3,1)

#remove the columns with NA values
spreaddata$TORRES<-NULL
spreaddata$LONGREACH<-NULL
spreaddata$BALONNE<-NULL

#plot corrplot
corrplot(cor(spreaddata[,-1]),
         order="hclust",
         tl.cex = .5, 
         tl.col = "black")

Pair plots

#install.packages("GGally")
library(GGally)

ggpairs(gambling.data[,-c(1,2)])

There are other functions and packages to obtain summary statistics of the datasets:

  • stat.desc() in the pastecs package
  • describe() in the Hmisc package
  • describe() in the psych package…

Simple Linear Regression

Simple linear regression is useful for examining or modelling the relationship between two variables.

ggplot(data=gambling.data,
       aes(x=Operational.EGMs,
           y=Player.Money.Lost,
           color=Local.Govt.Area,
           group=1))+
  guides(color=FALSE)+
  geom_point()+
  geom_smooth(method = "lm",se=TRUE)

Fit a linear regression and look at a summary of its results. The model is of the form \(y= mx+c\) where \(y=\)Player.Money.Lost and \(x=\)Operational.EGMs

dollars_per_machine_model <- lm(Player.Money.Lost ~ Operational.EGMs, 
                                data = gambling.data)
summary(dollars_per_machine_model)
## 
## Call:
## lm(formula = Player.Money.Lost ~ Operational.EGMs, data = gambling.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9146097  -246066      408   153205 11721046 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.724e+05  1.557e+04  -17.49   <2e-16 ***
## Operational.EGMs  4.152e+03  8.386e+00  495.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1083000 on 6673 degrees of freedom
## Multiple R-squared:  0.9735, Adjusted R-squared:  0.9735 
## F-statistic: 2.452e+05 on 1 and 6673 DF,  p-value: < 2.2e-16
attributes(dollars_per_machine_model)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
dollars_per_machine_model$coefficients 
##      (Intercept) Operational.EGMs 
##      -272416.501         4152.383
confint(dollars_per_machine_model)
##                        2.5 %      97.5 %
## (Intercept)      -302943.761 -241889.241
## Operational.EGMs    4135.944    4168.822
anova(dollars_per_machine_model)
## Analysis of Variance Table
## 
## Response: Player.Money.Lost
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Operational.EGMs    1 2.8758e+17 2.8758e+17  245187 < 2.2e-16 ***
## Residuals        6673 7.8268e+15 1.1729e+12                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# visualize residuals and fitted values.
plot(dollars_per_machine_model, 
     pch=16, 
     which=1)

hist(dollars_per_machine_model$residuals,
     breaks=1000,
     xlim=c(-2e6,2e6))

sd(dollars_per_machine_model$residuals)
## [1] 1082929

Multiple Linear Regression

We can make a much better model by including more variables. The functions we will use are exactly the same though.

dollars_model <- lm(Player.Money.Lost ~ Operational.EGMs + Operational.Sites + as.numeric(Date) + Local.Govt.Area, 
                                data = gambling.data)
summary(dollars_model)
## 
## Call:
## lm(formula = Player.Money.Lost ~ Operational.EGMs + Operational.Sites + 
##     as.numeric(Date) + Local.Govt.Area, data = gambling.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5480600  -179541   -11605   168530  6292338 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       1.375e+06  1.315e+05  10.462  < 2e-16
## Operational.EGMs                  4.950e+03  1.338e+02  37.005  < 2e-16
## Operational.Sites                -2.200e+05  3.490e+03 -63.027  < 2e-16
## as.numeric(Date)                 -2.966e-04  8.724e-05  -3.400 0.000677
## Local.Govt.AreaBANANA             8.594e+05  8.058e+04  10.665  < 2e-16
## Local.Govt.AreaBRISBANE           3.862e+07  1.211e+06  31.896  < 2e-16
## Local.Govt.AreaBUNDABERG          6.010e+06  1.807e+05  33.263  < 2e-16
## Local.Govt.AreaBURDEKIN           2.324e+05  7.930e+04   2.931 0.003393
## Local.Govt.AreaCAIRNS             6.187e+06  2.298e+05  26.920  < 2e-16
## Local.Govt.AreaCASSOWARY COAST    1.542e+06  8.892e+04  17.341  < 2e-16
## Local.Govt.AreaCENTRAL HIGHLANDS  2.336e+06  9.475e+04  24.659  < 2e-16
## Local.Govt.AreaCHARTERS TOWERS    6.474e+05  7.995e+04   8.097 6.63e-16
## Local.Govt.AreaCLONCURRY          1.948e+04  7.843e+04   0.248 0.803842
## Local.Govt.AreaDOUGLAS            6.988e+05  7.950e+04   8.791  < 2e-16
## Local.Govt.AreaFRASER COAST       5.469e+06  1.806e+05  30.288  < 2e-16
## Local.Govt.AreaGLADSTONE          3.616e+06  1.114e+05  32.456  < 2e-16
## Local.Govt.AreaGOLD COAST         2.261e+07  7.475e+05  30.240  < 2e-16
## Local.Govt.AreaGOONDIWINDI        7.911e+05  8.018e+04   9.866  < 2e-16
## Local.Govt.AreaGYMPIE             2.494e+06  9.476e+04  26.321  < 2e-16
## Local.Govt.AreaHINCHINBROOK       5.139e+05  7.875e+04   6.526 7.25e-11
## Local.Govt.AreaIPSWICH            6.552e+06  2.098e+05  31.231  < 2e-16
## Local.Govt.AreaISAAC              2.219e+06  8.869e+04  25.021  < 2e-16
## Local.Govt.AreaLIVINGSTONE        1.182e+06  8.404e+04  14.060  < 2e-16
## Local.Govt.AreaLOCKYER VALLEY     2.386e+06  9.001e+04  26.510  < 2e-16
## Local.Govt.AreaLOGAN              8.916e+06  2.629e+05  33.917  < 2e-16
## Local.Govt.AreaLONGREACH         -8.728e+04  8.953e+04  -0.975 0.329631
## Local.Govt.AreaMACKAY             6.274e+06  2.213e+05  28.355  < 2e-16
## Local.Govt.AreaMARANOA            5.808e+05  7.916e+04   7.338 2.43e-13
## Local.Govt.AreaMAREEBA            1.569e+05  7.959e+04   1.971 0.048738
## Local.Govt.AreaMORETON BAY        1.127e+07  4.058e+05  27.779  < 2e-16
## Local.Govt.AreaMOUNT ISA          1.394e+05  9.255e+04   1.507 0.131985
## Local.Govt.AreaMURWEH             3.045e+05  7.863e+04   3.872 0.000109
## Local.Govt.AreaNOOSA              2.023e+06  1.014e+05  19.944  < 2e-16
## Local.Govt.AreaNORTH BURNETT      7.991e+05  7.952e+04  10.050  < 2e-16
## Local.Govt.AreaREDLAND            4.060e+06  1.752e+05  23.175  < 2e-16
## Local.Govt.AreaROCKHAMPTON        6.143e+06  1.680e+05  36.556  < 2e-16
## Local.Govt.AreaSCENIC RIM         2.661e+06  9.186e+04  28.969  < 2e-16
## Local.Govt.AreaSOMERSET           1.424e+06  8.175e+04  17.414  < 2e-16
## Local.Govt.AreaSOUTH BURNETT      2.485e+06  9.413e+04  26.401  < 2e-16
## Local.Govt.AreaSOUTHERN DOWNS     2.004e+06  9.396e+04  21.332  < 2e-16
## Local.Govt.AreaSUNSHINE COAST     1.138e+07  3.931e+05  28.945  < 2e-16
## Local.Govt.AreaTABLELANDS        -6.438e+03  7.874e+04  -0.082 0.934835
## Local.Govt.AreaTOOWOOMBA          8.595e+06  2.361e+05  36.411  < 2e-16
## Local.Govt.AreaTORRES             1.678e+05  4.807e+05   0.349 0.727023
## Local.Govt.AreaTOWNSVILLE         7.903e+06  2.260e+05  34.961  < 2e-16
## Local.Govt.AreaWESTERN DOWNS      3.432e+06  1.030e+05  33.304  < 2e-16
## Local.Govt.AreaWHITSUNDAY         2.948e+06  1.011e+05  29.164  < 2e-16
##                                     
## (Intercept)                      ***
## Operational.EGMs                 ***
## Operational.Sites                ***
## as.numeric(Date)                 ***
## Local.Govt.AreaBANANA            ***
## Local.Govt.AreaBRISBANE          ***
## Local.Govt.AreaBUNDABERG         ***
## Local.Govt.AreaBURDEKIN          ** 
## Local.Govt.AreaCAIRNS            ***
## Local.Govt.AreaCASSOWARY COAST   ***
## Local.Govt.AreaCENTRAL HIGHLANDS ***
## Local.Govt.AreaCHARTERS TOWERS   ***
## Local.Govt.AreaCLONCURRY            
## Local.Govt.AreaDOUGLAS           ***
## Local.Govt.AreaFRASER COAST      ***
## Local.Govt.AreaGLADSTONE         ***
## Local.Govt.AreaGOLD COAST        ***
## Local.Govt.AreaGOONDIWINDI       ***
## Local.Govt.AreaGYMPIE            ***
## Local.Govt.AreaHINCHINBROOK      ***
## Local.Govt.AreaIPSWICH           ***
## Local.Govt.AreaISAAC             ***
## Local.Govt.AreaLIVINGSTONE       ***
## Local.Govt.AreaLOCKYER VALLEY    ***
## Local.Govt.AreaLOGAN             ***
## Local.Govt.AreaLONGREACH            
## Local.Govt.AreaMACKAY            ***
## Local.Govt.AreaMARANOA           ***
## Local.Govt.AreaMAREEBA           *  
## Local.Govt.AreaMORETON BAY       ***
## Local.Govt.AreaMOUNT ISA            
## Local.Govt.AreaMURWEH            ***
## Local.Govt.AreaNOOSA             ***
## Local.Govt.AreaNORTH BURNETT     ***
## Local.Govt.AreaREDLAND           ***
## Local.Govt.AreaROCKHAMPTON       ***
## Local.Govt.AreaSCENIC RIM        ***
## Local.Govt.AreaSOMERSET          ***
## Local.Govt.AreaSOUTH BURNETT     ***
## Local.Govt.AreaSOUTHERN DOWNS    ***
## Local.Govt.AreaSUNSHINE COAST    ***
## Local.Govt.AreaTABLELANDS           
## Local.Govt.AreaTOOWOOMBA         ***
## Local.Govt.AreaTORRES               
## Local.Govt.AreaTOWNSVILLE        ***
## Local.Govt.AreaWESTERN DOWNS     ***
## Local.Govt.AreaWHITSUNDAY        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 674500 on 6628 degrees of freedom
## Multiple R-squared:  0.9898, Adjusted R-squared:  0.9897 
## F-statistic: 1.397e+04 on 46 and 6628 DF,  p-value: < 2.2e-16
plot(dollars_model, 
     pch=16, 
     which=1)

hist(dollars_model$residuals,
     breaks=1000,
     xlim=c(-2e6,2e6))

sd(dollars_model$residuals)
## [1] 672180.9

Hypothesis testing

A hypothesis test (or more precisely a p-value from a hypothesis test) will tell you the probability that the data you have is generated by the null model (your null hypothesis).

For example, I may have rolled a red die 10 times and got the following results:

redDie<-c(6, 4, 4, 6, 2, 1, 4, 6, 5, 2)

Let’s say that my “null model” is that the mean of many rolls will be 3.5, i.e., the result we would expect, on average, if the die is fair.

mean(redDie)
## [1] 4

Is this data consistent with our hypothesis?

redDieTTest<-t.test(redDie, alternative="two.sided", mu = 3.5 ,conf.level=0.95)
redDieTTest
## 
##  One Sample t-test
## 
## data:  redDie
## t = 0.86603, df = 9, p-value = 0.409
## alternative hypothesis: true mean is not equal to 3.5
## 95 percent confidence interval:
##  2.693943 5.306057
## sample estimates:
## mean of x 
##         4
names(redDieTTest)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
redDieTTest$parameter
## df 
##  9
redDieTTest$p.value
## [1] 0.4089688
redDieTTest$conf.int
## [1] 2.693943 5.306057
## attr(,"conf.level")
## [1] 0.95
redDieTTest$estimate
## mean of x 
##         4
redDieTTest$null.value
## mean 
##  3.5
redDieTTest$alternative
## [1] "two.sided"
redDieTTest
## 
##  One Sample t-test
## 
## data:  redDie
## t = 0.86603, df = 9, p-value = 0.409
## alternative hypothesis: true mean is not equal to 3.5
## 95 percent confidence interval:
##  2.693943 5.306057
## sample estimates:
## mean of x 
##         4

I then rolled a blue die 10 times and got the following results from it:

blueDie<-c(5, 1, 4, 2, 3, 1, 5, 4, 4, 1)

Are the mean values for ten roll different between the red and blue die (within statistical uncertainty) ?

redblueDieTTest<-t.test(redDie,blueDie, alternative="two.sided", mu = 0 ,conf.level=0.95)
redblueDieTTest
## 
##  Welch Two Sample t-test
## 
## data:  redDie and blueDie
## t = 1.291, df = 17.78, p-value = 0.2132
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6288085  2.6288085
## sample estimates:
## mean of x mean of y 
##         4         3

Interpretation of the result

The P-value (0.3622) is greater than the significance level 5% (1-0.95), so we conclude that the null hypothesis that the mean of this population is 9 is plausible.

v <- rnorm(10,9, 6)
t.test(v-0, alternative="two.sided", conf.level=0.95)  
## 
##  One Sample t-test
## 
## data:  v - 0
## t = 6.8047, df = 9, p-value = 7.864e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   7.285841 14.542481
## sample estimates:
## mean of x 
##  10.91416

Homogeneous variances

Before proceeding with the t-test, it is necessary to evaluate the sample variances of the two groups, using a Fisher’s F-test to verify the homoskedasticity (homogeneity of variances). In R you can do this in this way:

a <- c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b <- c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)
var.test(a,b)
## 
##  F test to compare two variances
## 
## data:  a and b
## F = 2.1028, num df = 9, denom df = 9, p-value = 0.2834
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5223017 8.4657950
## sample estimates:
## ratio of variances 
##           2.102784

We obtained p-value greater than 0.05, then we can assume that the two variances are homogeneous. Indeed we can compare the value of F obtained with the tabulated value of F for alpha = 0.05, degrees of freedom of numerator = 9, and degrees of freedom of denominator = 9, using the function qf(p, df.num, df.den):

qf(0.95, 9, 9)
## [1] 3.178893

Note that the value of F computed is less than the tabulated value of F, which leads us to accept the null hypothesis of homogeneity of variances. NOTE: The F distribution has only one tail, so with a confidence level of 95%, p = 0.95. Conversely, the t-distribution has two tails, and in the R’s function qt(p, df) we insert a value p = 0975 when you’re testing a two-tailed alternative hypothesis.

Two-sample hypothesis test

t-Test to compare the means of two groups under the assumption that both samples are random, independent, and come from normally distributed population with unknow but equal variances To solve this problem we must use to a Student’s t-test with two samples, assuming that the two samples are taken from populations that follow a Gaussian distribution(test called Wilcoxon-Mann-Whitney test). For samples with Homogeneous variances (var.equal = TRUE),and independent samples (paired = FALSE: you can omit this because the function works on independent samples by default).

t.test(a,b, var.equal=TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  a and b
## t = -0.94737, df = 18, p-value = 0.356
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.93994   4.13994
## sample estimates:
## mean of x mean of y 
##     174.8     178.2

Conducting Chi-Squared Test of Independence

The chi-square test of independence is a parametric method appropriate for testing independence between two categorical variables. Dataset use here is survey, the Smoke column records the students smoking habit, while the Exer column records their exercise level. The allowed values in Smoke are “Heavy”, “Regul” (regularly), “Occas” (occasionally) and “Never”. As for Exer, they are “Freq” (frequently), “Some” and “None”.

Question: test the hypothesis whether the students smoking habit is independent of their exercise level at 0.05 significance level (explore the relationship between variabele).

First: Make a the contingency table of the two variables, the students smoking habit against the exercise level using the table function in R. The result is called the contingency table of the two variables.

library(MASS)       # load the MASS package 
tbl <- table(survey$Smoke, survey$Exer) 
tbl<-tbl[c(2,3,4,1),]                 # the contingency table
tbl
##        
##         Freq None Some
##   Never   87   18   84
##   Occas   12    3    4
##   Regul    9    1    7
##   Heavy    7    1    3

See the relationship using barplot:

barplot(tbl, 
        beside=T, 
        legend=T,
        xlab = "Exercise",
        ylab = "Number of People")

Solution:We apply the chisq.test function to the contingency table tbl, and found the p-value.

chisq.test(tbl) 
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 5.4885, df = 6, p-value = 0.4828

Answer: As the p-value 0.4828 is greater than the .05 significance level, we do not reject the null hypothesis that the smoking habit is independent of the exercise level of the students.

Enhanced Solution The warning message found in the solution above is due to the small cell values in the contingency table. To avoid such warning, we can combine the second and third columns of tbl, and save it in a new table named ctbl. Then we apply the chisq.test function against ctbl instead.

ctbl = cbind(tbl[,"Freq"], tbl[,"None"] + tbl[,"Some"]) 
ctbl 
##       [,1] [,2]
## Never   87  102
## Occas   12    7
## Regul    9    8
## Heavy    7    4
 EX_SM<- chisq.test(ctbl) 
 EX_SM
## 
##  Pearson's Chi-squared test
## 
## data:  ctbl
## X-squared = 3.2328, df = 3, p-value = 0.3571
attributes(EX_SM)
## $names
## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"   
## 
## $class
## [1] "htest"

If the assumptions of Chi-square test are met we may consider using Fisher’s Exact test that is non-parametric alternative to the Chi-Square test.

fisher.test(tbl, conf.int=T, conf.level=0.99)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  tbl
## p-value = 0.4138
## alternative hypothesis: two.sided

Anova: One-Way Analysis of Variance (ANOVA)

Anova is a parameteric method appropriate for comparing the Means for the 2 or more in dependent populations.

attach(InsectSprays)
 data(InsectSprays)
 str(InsectSprays)
## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

We’re going to use a data set called InsectSprays. 6 different insect sprays (1 Independent Variable with 6 levels) were tested to see if there was a difference in the number of insects found in the field after each spraying (Dependent Variable).

InsectSprays
##    count spray
## 1     10     A
## 2      7     A
## 3     20     A
## 4     14     A
## 5     14     A
## 6     12     A
## 7     10     A
## 8     23     A
## 9     17     A
## 10    20     A
## 11    14     A
## 12    13     A
## 13    11     B
## 14    17     B
## 15    21     B
## 16    11     B
## 17    16     B
## 18    14     B
## 19    17     B
## 20    17     B
## 21    19     B
## 22    21     B
## 23     7     B
## 24    13     B
## 25     0     C
## 26     1     C
## 27     7     C
## 28     2     C
## 29     3     C
## 30     1     C
## 31     2     C
## 32     1     C
## 33     3     C
## 34     0     C
## 35     1     C
## 36     4     C
## 37     3     D
## 38     5     D
## 39    12     D
## 40     6     D
## 41     4     D
## 42     3     D
## 43     5     D
## 44     5     D
## 45     5     D
## 46     5     D
## 47     2     D
## 48     4     D
## 49     3     E
## 50     5     E
## 51     3     E
## 52     5     E
## 53     3     E
## 54     6     E
## 55     1     E
## 56     1     E
## 57     3     E
## 58     2     E
## 59     6     E
## 60     4     E
## 61    11     F
## 62     9     F
## 63    15     F
## 64    22     F
## 65    15     F
## 66    16     F
## 67    13     F
## 68    10     F
## 69    26     F
## 70    26     F
## 71    24     F
## 72    13     F
mean(count[spray=="A"])
## [1] 14.5
# Look at the mean
tapply(count, spray, mean) 
##         A         B         C         D         E         F 
## 14.500000 15.333333  2.083333  4.916667  3.500000 16.666667
tapply(count, spray, var)
##         A         B         C         D         E         F 
## 22.272727 18.242424  3.901515  6.265152  3.000000 38.606061
tapply(count, spray, length)
##  A  B  C  D  E  F 
## 12 12 12 12 12 12
boxplot(count ~ spray)

Run ANOVA H0: Mean count is the same for all sprays

 ANOVA.Sprays <- aov(count ~ spray, data=InsectSprays)
summary(ANOVA.Sprays)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As P-value is smaller tham 0.05 we will reject the H0; and conclude that not are the means are the same.

plot((ANOVA.Sprays))

attributes(ANOVA.Sprays)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "aov" "lm"
ANOVA.Sprays$coefficients
## (Intercept)      sprayB      sprayC      sprayD      sprayE      sprayF 
##  14.5000000   0.8333333 -12.4166667  -9.5833333 -11.0000000   2.1666667

Run the multiple comparison to define that which means may differ from the others. All possible pairewise comaprison of the

TukeyHSD(ANOVA.Sprays)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## $spray
##            diff        lwr       upr     p adj
## B-A   0.8333333  -3.866075  5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A  -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A   2.1666667  -2.532742  6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B   1.3333333  -3.366075  6.032742 0.9603075
## D-C   2.8333333  -1.866075  7.532742 0.4920707
## E-C   1.4166667  -3.282742  6.116075 0.9488669
## F-C  14.5833333   9.883925 19.282742 0.0000000
## E-D  -1.4166667  -6.116075  3.282742 0.9488669
## F-D  11.7500000   7.050591 16.449409 0.0000000
## F-E  13.1666667   8.467258 17.866075 0.0000000

Visual display of paired comparisons of the mean of insect count for sprays

plot(TukeyHSD(ANOVA.Sprays), las=1)

Kruskal Wallis One-Way Analysis of Varianve is a non-parameteric equivalent to the one-way Analysis of Variance

kruskal.test(count ~ spray, data=InsectSprays)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  count by spray
## Kruskal-Wallis chi-squared = 54.691, df = 5, p-value = 1.511e-10

We will reject the null hypothesis.

Normal Distribution

Normal Distribution

Normal Distribution

A common distribution which is used in several statistical tests is a normal distribution. A standard normal distributed dataset is a dataset with standard deviation \(\sigma=1\) and mean \(\mu=0\), if you don’t provide these arguments, then these values are assumed.

pnorm(q) gives the integrated probability up to q standard deviations away from the mean

pnorm(q = 2)
## [1] 0.9772499
pnorm(q = 0)
## [1] 0.5

qnorm(p) gives the number of standard deviations away from the mean to get an integrated probability of p

qnorm(p = 0.95)
## [1] 1.644854
qnorm(p = 0.025, mean = 2, sd = 0.5)
## [1] 1.020018

rnorm(n) gives a vector of n random samples from the standard normal distribution (or you can specify a mean and a standard deviation)

rnorm(n = 4, mean = 2, sd = 2)
## [1] 5.2152317 0.5935958 2.4702616 3.7358675

Visual ways of checking the distribution of data are histogram, boxplots, and Q-Q plot.

Q-Q Normal Plots

When A and B have the same distribution the Q-Q plot is a 45 degree straight line. In many instances we need to know if a sample comes from a normal distribution, then the Q-Q plot of A against the standard normal distribution. NOTE: This will be a straight line if the distribution of A is normal of any mean and standard deviation. There is an R function that does all of this: qqnorm

v <- rnorm(100, 3, 2)
qqnorm(v)
qqline(v, col = 2)

The Q-Q line is drawn so that it passes through the first and third quantile. All of the points should be on this line when the sample is normal.

v <- rnorm(100, 0, 1)
qqnorm(v)
qqline(v, col = 2)

v <- rnorm(10000, 0, 1)
qqnorm(v)
qqline(v, col = 2)

Q-Q Plot to Compare Two Samples

v1 <- rnorm(100, 0, 1)
v2 <- rnorm(150, 3, 1)
qqplot(v1, v2)